Week 10.2 - Agents: Failure Modes for Multi-Step / Long-Horizon Tasks

🎯 What We'll Cover

Week 9 gave us a taxonomy for how AI fails: some failures have been patched, some are reduced but persistent, and some are structural — baked into how the systems work and unlikely to be fixed by the next release. That taxonomy was built for single-turn chatbots. This sub-lesson applies it to agents, which fail in genuinely new ways because they do not answer once and stop. They run for many steps, accumulate state, call tools, and act on whatever those tools return.

The single most important idea in this sub-lesson is a distinction: reliability is not accuracy. An agent can be more accurate than ever — better at any individual step — and still be unreliable, because reliability is about doing the whole multi-step task consistently, recovering from its own mistakes, and knowing when it is wrong. A February 2026 Princeton study put numbers on this: across 14 models and 18 months, accuracy improved substantially while reliability barely moved. For a researcher deciding whether to hand a task to an agent, the gap between those two things is exactly where the danger lives.

As always, the framing is the durable part and the examples are the snapshot. The specific failure stories below will date; the three-category question — is this patched, reduced, or structural? — will not.

⏱️ Why Agents Need Their Own Failure Lesson

A chatbot makes one prediction and stops. If it is right 90% of the time, that is roughly your experience of it. An agent runs the model in a loop — dozens or hundreds of times for a single task — and three things change.

Errors compound. If each step is 95% reliable and the task takes twenty independent steps, the probability that every step is correct is roughly 0.95²⁰ — about 36%. High per-step accuracy can still produce a task that usually fails. This is the agentic version of the Week 7 silent-error problem: each step looks plausible, the end result is wrong.
State accumulates. The agent carries its own earlier outputs forward as context. An early mistake does not just cost one step — it poisons every step that depends on it, and the agent rarely notices.
The agent acts. A chatbot that hallucinates wastes your time. An agent that hallucinates can delete a file, send an email, or spend money — because it has tools and permissions (the harness components from 10.1). The cost of a failure is no longer just a wrong answer.

📊 Reliability ≠ accuracy — the definition that anchors this week

Accuracy asks: when the agent acts, is the action correct? Reliability asks: across many runs and many steps, does the agent consistently complete the task, recover from its own errors, behave predictably, and avoid unsafe actions? You can improve the first without improving the second. A model that is more accurate per step but no better at the other things is a more capable component inside a system that is no more trustworthy.

🗃️ The Agent Failure Taxonomy

Here is the Week 9.2 taxonomy, applied to agents. As before, the value is in placing a failure you encounter into the right category, because the category tells you what to do about it.

✅ (a) Patched / Largely Solved

Early agent demos (2023–24) failed in ways that have since been largely engineered away by the harness layer:

Malformed tool calls. Early agents frequently emitted tool requests that did not parse. Structured tool-calling APIs and constrained decoding have largely fixed this.
“Forgetting” available tools. Better system prompts and tool-discovery steps (exactly the harness changes in the LangChain worked example from 10.1) have substantially reduced this.
Trivially hallucinated tools. Calling a function that was never provided is now rare on short tasks, though it returns on long ones.

The lesson here mirrors 9.2: if you see one of these, you are probably using an old model or a thin harness. Switch to a current frontier tool and the problem usually disappears.

🛡️ (b) Reduced but Persistent

These are mitigated but resurface, especially on longer tasks. They are the failure modes most relevant to real research workflows in 2026:

Loop instability

Agents get stuck repeating an edit, re-running a failing command, or oscillating between two states. Harnesses now ship explicit loop-detectors (again, see the LangChain recipe in 10.1) — which tells you the problem is real enough to need a dedicated mitigation.

Over-eager action

Given write permissions, agents delete more than asked, refactor confidently in the wrong direction, or “fix” things that were not broken. The Week 9 pattern of confidently-defended wrong reasoning becomes confidently-executed wrong action.

Sycophancy over long sessions

The Week 9 sycophancy finding (Sharma et al., 2023) intensifies in long agentic conversations: the longer the session and the more you push back, the more the agent drifts toward agreeing with you rather than the task.

Confidence miscalibration over horizons

An agent's sense of “I've got this” does not decay as a task gets longer and riskier, even though its actual success rate does. The confidence signal you might want to threshold on is least trustworthy exactly when you need it most.

🧨 (c) Structural and Likely Persistent

These follow from how agents work, not from fixable engineering gaps. They are the ones to design your workflow around, because the next model release will not remove them. Each gets its own section below: the reliability gap, long-horizon planning collapse, and prompt injection. Compositional brittleness — the compounding-error problem from the top of this lesson, and the same effect behind ProgramBench scoring 0% on whole-repository tasks in Week 9 — sits here too.

📉 Reliability Is Not Accuracy: The Princeton Finding

The structural case is made most rigorously by a February 2026 paper from Princeton's Center for Information Technology Policy: Towards a Science of AI Agent Reliability (Rabanser, Kapoor, Kirgis, Liu, Utpala & Narayanan). It is the agent-era counterpart to the Kalai et al. hallucination paper we met in Week 9 — a careful argument that a whole class of failure is structural, not a passing engineering problem.

The authors evaluated 14 models from OpenAI, Google, and Anthropic spanning 18 months of releases, across roughly 500 benchmark runs. Their headline finding is the one in this section's title:

Accuracy went up. Reliability did not.

“Reliability has improved only modestly over 18 months, while accuracy improved substantially.” Compressing an agent's behaviour into a single success score, the authors argue, hides the operational flaws that actually determine whether you can depend on it.

They decompose reliability into four dimensions — consistency, robustness, predictability, and safety — measured by twelve concrete metrics. The weakest, across the board, is predictability: “agents are not good at knowing when they're wrong.” On one benchmark, most models could not distinguish their own correct answers from their incorrect ones better than chance.

Two numbers make this concrete. Outcome consistency scores ranged from 30% to 75% across the models tested — meaning that running the same agent on the same task under identical conditions often produced different outcomes. And because predictability is so weak, you cannot use the agent's own confidence as a filter: the “ask it only when it's sure” strategy fails, because it is not reliably sure when it should be.

⚠️ What this means for delegating a task

If you run an agent once and it succeeds, that is weak evidence it will succeed next time — outcome consistency can be as low as a coin flip. And you cannot lean on the agent to tell you when it has failed, because knowing-when-it's-wrong is its weakest skill. Both facts push in the same direction: the verification burden stays with you, and it is heavier for agents than for chatbots, not lighter. This is the Week 9 verification lesson, scaled up.

🧭 Long-Horizon Planning Collapse

Why do long tasks fail so distinctively? A January 2026 paper, Why Reasoning Fails to Plan (Wang et al.), gives a structural answer that is worth understanding because it explains a failure you will see repeatedly.

The argument: the step-by-step reasoning that makes models good at short tasks is, formally, a greedy strategy — it picks the locally best next step. Greedy strategies are fine over short horizons. Over long ones they fail, because an early choice that looks locally optimal commits the agent to a path whose costs only show up much later, and those early commitments get amplified over time and are difficult to recover from. The authors prove the gap is structural, not a matter of model capability: a more powerful reasoner making step-wise decisions is still fundamentally limited on long-horizon tasks.

💡 The counter-intuitive result: small + planning beats large + greedy

The paper's most striking demonstration: with their lookahead method (FLARE), “LLaMA-8B with FLARE frequently outperforms GPT-4o with standard step-by-step reasoning” on long-horizon benchmarks. A small model that genuinely plans can beat a much larger model that merely reasons step by step.

The research lesson is not “use LLaMA-8B”. It is that reasoning is not planning. When you see an agent handle each individual step competently and still walk the task into a wall, you are watching greedy step-wise reasoning fail at planning — a structural limitation, not a bug that more model scale will fix.

📈 Watching Agents Run Long: Two 2026 Benchmarks

A new class of benchmark in 2026 stopped measuring single answers and started measuring whether an agent can hold a task together over a long horizon. Two are worth knowing, both because of what they measure and because they make the structural failures visible.

YC-Bench — run a startup for a (simulated) year

YC-Bench (arXiv:2604.01212) tasks an agent with running a simulated startup over a one-year horizon spanning hundreds of turns — managing employees, choosing contracts, and staying profitable while adversarial clients and rising payroll create compounding consequences. It is long-horizon planning made measurable.

Top result: Claude Opus 4.6 at $1.27M average final funds — with the much cheaper GLM-5 (from China's Zhipu AI) close behind at $1.21M for roughly 11× lower inference cost. A preview of the lower-cost and Chinese-model story we develop in 10.5.

The “Meltdown Onset Point”

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents (arXiv:2603.29231) evaluated 10 models across 23,392 episodes and introduced metrics for how agents fall apart over time — including a Meltdown Onset Point that detects behavioural collapse by watching the agent's tool-call patterns become erratic.

That a benchmark needs a dedicated “meltdown” metric tells you the phenomenon is common enough to formalise: past some task length, agents do not just get things wrong, they come apart.

💉 Prompt Injection: The Structural Security Hole

One structural failure deserves singling out because it is a security problem, not just a reliability one. An agent reads tool output — web pages, file contents, emails, search results — and acts on it. If an attacker can place text in that content, they can try to issue instructions to your agent. This is prompt injection, and the practitioner who has tracked it most closely, Simon Willison, has made the uncomfortable point repeatedly: we have known about it since 2022 and still have no robust, general defence.

For a researcher this is concrete. If you point an agent at the open web, or at a shared document, or at your inbox, you are trusting every piece of content it reads not to contain hostile instructions — and the agent has no reliable way to tell data from commands, because to a language model they are both just text. The mitigation is not technical cleverness on your side; it is the permissions dial from 10.1. An agent that can only read is far harder to weaponise than one that can also send, delete, or pay.

🌕 An African-context note: where the agent acts matters

Prompt injection and the permissions dial are not abstractions when an agent touches real systems. An agent with access to a university inbox, a grant portal, or institutional data is acting inside systems governed by South Africa's Protection of Personal Information Act (POPIA). The structural insecurity of agents is one more reason the verification and permissioning burden cannot be delegated — a theme we return to concretely in the Week 10 activity, where you will be asked to state where your tools send data before you use them.

🧯 Putting the Taxonomy to Work

As in Week 9, the point of the taxonomy is action. When you observe an agent failure, locate it — the category tells you what to do.

Observed failure	Category	What to do
Agent emits a broken tool call or forgets it has a tool	Patched	You're on an old model or thin harness. Switch to a current frontier tool.
Agent gets stuck in a loop or re-runs a failing step	Reduced but persistent	Use a harness with loop-detection; cap the number of steps; watch it rather than walking away.
Agent deletes or changes more than you asked	Reduced but persistent	Turn down the permissions dial. Run read-only or in a sandbox until you trust the task.
Agent succeeds once; you assume it always will	Structural (consistency)	Don't. Outcome consistency can be 30–75%. Re-run, or verify the output directly.
Agent handles each step well but walks the whole task into a wall	Structural (planning)	Long-horizon planning collapse. Break the task into shorter, checkpointed pieces you verify between.
Agent acts on a web page / email / document that contained instructions	Structural (prompt injection)	Assume any content the agent reads may be hostile. Restrict permissions; never give a web-reading agent send/pay/delete rights you wouldn't give a stranger.

The skill is the taxonomy, not the examples

The specific tools and figures in this sub-lesson will date within months. What persists is the move: when an agent fails, ask whether the failure is patched (you're using the wrong tool), reduced-but-persistent (manage it with harness settings and supervision), or structural (design your workflow around it, because the next release will not save you).

And the one structural fact to carry forward: accuracy is not reliability. More capable agents are not automatically more trustworthy ones. The trust has to be earned per task, by you, through verification.

📖 Sources & Further Reading

Rabanser, S., Kapoor, S., Kirgis, P., Liu, K., Utpala, S., & Narayanan, A. (2026). Towards a Science of AI Agent Reliability. arXiv:2602.16666 (Princeton CITP) — reliability vs accuracy; four dimensions; the predictability finding.
Wang, Z., et al. (2026). Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents. arXiv:2601.22311 — planning collapse is structural; FLARE; small-plus-planning beats large-plus-greedy.
YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution. arXiv:2604.01212 — the simulated-startup long-horizon benchmark.
Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents. arXiv:2603.29231 — reliability-decay and the Meltdown Onset Point.
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why Language Models Hallucinate. arXiv:2509.04664 — the single-turn parent of the agent reliability problem (from Week 9).
Simon Willison on prompt injection — the standing argument that we still lack a robust general defence.

👉 What Comes Next

Sub-Lesson 10.3 — The Current Tool Landscape (Including MCP). Having defined agents (10.1) and mapped how they fail (10.2), we get concrete about the tools themselves: the coding, computer-use, research, and general agents available in May 2026, and the Model Context Protocol (MCP) that increasingly connects them to your data. We carry the permissions-dial and reliability lessons straight into that tour — because choosing an agent tool is, in large part, choosing how much you are willing to let it do on your behalf.